This report explores a dataset containing quality and attributes of 1599 red wines. Our target is to findout what attributes influance the quality of the wine.
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality
## Min. : 8.40 Min. :3.000
## 1st Qu.: 9.50 1st Qu.:5.000
## Median :10.20 Median :6.000
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Wine data has 1599 observations and 13 variables.
Based on the plot, Most of the red wines fall under average quality.To better understand this, I have created a new variable ‘quality.score’ which maps the wine quality score to Poor , Average, Good and Excellent. Quality score of less than “4” falls under “Poor”, Score of “5” falls on “Average”, “6” goes to “Good” and Above “6” are “Excellent” category.
#Creating a new variable to represent wine quality as a category
winedata$quality.score <- ifelse(winedata$quality<=4,"Poor",
ifelse(winedata$quality==5,
"Average",
ifelse(winedata$quality==6,"Good",
"Excellent")))
#Update the order of factor
winedata$quality.score <- ordered(winedata$quality.score,
levels = c("Poor","Average","Good","Excellent"))
##
## Poor Average Good Excellent
## 63 681 638 217
We can see that most of the wines falls under average and good quality. Excellent wines and Poor wines are less in number.
Lets look at summary of alcohol volume in wines.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.40 9.50 10.20 10.42 11.10 14.90
Mean alcohol content of wine is 10.42%. Minimum level of alcohol present in a wine is 8.40%
Alcohol content is skewed right. So wines with high alochol contents are less in numbers.
Lets look through summary of fixed acidity.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.60 7.10 7.90 8.32 9.20 15.90
Fixed acidity is positively skewed and most of the wine has a fixed acidity from 7.10 g/dm3 - 9.20 g/dm3. Mean fixed acidity is 8.32 g/dm3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1200 0.3900 0.5200 0.5278 0.6400 1.5800
Voltile acidity is skewed right. Mean volatile acidity is 0.53 d/gm3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.090 0.260 0.271 0.420 1.000
Citric acid seems to have a bi-model distribution without log tranformation. With log transformation it is negatively skewed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.900 1.900 2.200 2.539 2.600 15.500
Most wines has residual sugar between 1.9 d/gm3 to 2.6 d/gm3. Without log transformation residual sugar has a skewed right distibution. After log transformation, It has almost normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
Chlorides has a normal distibution with log10 transformation. Most of the values chlorides fall between 0.07 d/gm3 - 0.09 d/gm3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 14.00 15.87 21.00 72.00
Free sulfur has a skewed right distribution. Most of the wines has free sulfur dioxide of 7 mg/dm3 to 21 mg/dm3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6.00 22.00 38.00 46.47 62.00 289.00
Log transformation of total sulfur dioxide has a normal distribution. Average value of total sulfur dioxide is 46.47 mg/dm3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9901 0.9956 0.9968 0.9967 0.9978 1.0040
Density is normally distributed. Mean density is 0.997 g/cm3.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.740 3.210 3.310 3.311 3.400 4.010
pH has a normal distribution and mean pH is 3.31 on pH scale.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3300 0.5500 0.6200 0.6581 0.7300 2.0000
Sulphates has a normal distribution with log transformation. Most alcohol has suplhates of range 0.55 g/dm3 to 0.73 g/dm3.
There are 1599 observations with 13 variables.
Main feature of interest is wine quality.
I think acids, sugars and alcohol will mostly drive the quality of the wine. We will explore all features to findout exactly what affects the quality of wine.
Yes. A new variable named ‘quality.score’ has created which categorises numerical wine quality to ‘Poor’, ‘Average’, ‘Good’ or ‘Excellent’.
Yes, Citric acid has an unusual distribution. Structure of the data is not changed.
Based on the plot matrix, Most correlated feature affecting wine quality is alcohol, Followed by volatile acidity, sulphates and citric acid.
Below is the correlation value of each feature with quality.
Alcohol : 0.48
Volatile Acidity : -0.39
Sulphates : 0.25
Citric Acid : 0.23
Alcohol Correlation with Quality
##
## Pearson's product-moment correlation
##
## data: winedata$quality and winedata$alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4373540 0.5132081
## sample estimates:
## cor
## 0.4761663
We can see that alcohol has a linear correlation with quality of the wine. Wines with high alocohol content are less in numbers. May be these wines will be of Excellent quality catgory. Lets findout more on these later.
Volatile Acidity Correlation with Quality
##
## Pearson's product-moment correlation
##
## data: winedata$quality and winedata$volatile.acidity
## t = -16.954, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4313210 -0.3482032
## sample estimates:
## cor
## -0.3905578
From the above plot we can see that, High quality wines has low acetic acid. There is a clear linear relationship between volatile acidity and wine quality.
Sulphates Correlation with Quality
##
## Pearson's product-moment correlation
##
## data: winedata$quality and winedata$sulphates
## t = 10.38, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2049011 0.2967610
## sample estimates:
## cor
## 0.2513971
With some exceptions, As pottassium suplhates increases Wine quality also increases. Main exception is, Some poor wines (Score 4-5) uses high sulphates or low sulphates, But their quality doesn’t change.
Citric Acid Correlation with Quality
##
## Pearson's product-moment correlation
##
## data: winedata$quality and winedata$citric.acid
## t = 9.2875, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1793415 0.2723711
## sample estimates:
## cor
## 0.2263725
Citric acid has a small correlation with wine quality.
Alcohol content varies with density. Low density wine has high alcohol content.
There is a strong correlation between pH and fixed acidity.
Alcohol content has a linear relationship with wine quality. Volatile acidity has a negative correlation with Wine quality.Most of the average and excellent quality wines has low volume of volatile acidity.
Alcohol and volatile acidity are the main features affecting wine quality.Sulphates and citric acid volume also affects wine quality in minimal level.
Alcohol content varies with density. Low density wine has high alcohol content.Volatile acidity has a negative correlation with citric acid and citic acid has a positive correlation with fixed acidity and a negative correlation with pH.
Yes.Total sulfur dioxide has a positive correlation with free sulfur dioxide. pH has negative correlation with fixed acidity. Density has a positive correlation with fixed acidity and residual sugar.
There is a strong correlation between pH and fixed acidity. But pH or fixed acidity doesn’t have a good correlation with wine quality.
Citric acid and fixed acidity has a strong correlation. Fixed acidity also has a strong correlation with density.
Most of the good and excellent wines has low volume of voltaile acidity and high percentage of alcohol.
There is an interesting relationship between sulphates and alcohol. Most of the Excellent and good quality wines has high percentage of alcohol and high volume of sulphates.
A large number of good and excellent wines fall under high percentage of alcohol and high voulume of citric acid.
From the multivariate plots its clear that Volatile acidity, Suphates and Citric acid also drives the quality of Wine after alcohol.
High alcohol content with low volume of volatile acidity increases the wine quality.High alcohol content with high volume of sulphates also increases the wine quality. Also a large number of good and excellent wines fall under high percentage of alcohol and high voulume of citric acid.
Yes, There is an interesting relationship between sulphates and alcohol. Most of the Excellent and good quality wines has high percentage of alcohol and high volume of sulphates.
Alcohol content in wine is one of the main criteria which influance the quality of the wine. Mean alcohol content in wine is 10.42%.
Most of the good and excellent quality wines falls under high percentage of alcohol content and low volume of volatile acidity. Quality of the wine heavily depends on these two attributes. .
Most of the Excellent and good wines has higher percentage of alcohol with higher volume of sulphates.
Wine Data has 1599 samples and 13 features. Our target was to findout what attributes influance the quality of the wine.
On the sample dataset, Highest quality scale of wine was 8. Number of wines with high quality scores was less. Combined number of wines with quality score of 7 & 8 was only 217 out of 1599 samples. So with only a small number of high quality wines, It was a bit difficult to find the attributes which influance the quality.
So I decided to categorise the wine based on its score. I created four wine catgories “Excellent”, “Good”, “Average” and “Poor”. Quality score of “7 & 8” falls under “Excellent”, Quality of score of “6” falls under “good”, “5” as “Average” and below “5” as poor. This made a large difference as 855 wines falls under “Good” and “Excellent” wines combined.
Distribution of Alchohol, Fixed Acidity, Voltaile acidity and Sulfur is skewed right. Citric acid has a bimodel distribution. With log10 transformation, Citric acid distribution is negatively skewed. Density and pH has a normal distribution. After log transformation Residual Sugar, Chlorides, Total sulfur dioxide and Suplhates has a normal distribution.
After matrix plot, Found that Alcohol, Voltaile acid, Sulphates and Citric acid are the main attributes which influance the quality of the wine.
After plotting different bivariate analysis, Found that most of the Good and excellent quality wines has higher percentage of alcohol. Voltile acidity has a negative correlation with wine quality. As wine quality increases, Voltile acidity volume decreases. Quality of wine also has a small positive correlation with suplhates and citric acid.
After doing Multivariate analysis, Found that most of the Good and Excellent wines has higher percentage of alchohol, Lower volume of volatile acid and higher volume of Sulphates.
For future analysis, It would be interesting if we can get “viticulture” and “Vinification” datas on the dataset. I read on a site that these factors affects the quality of the wine severely. This will be hard to measure, But having that data will make the analysis more interesting.